Skip to main content

Module 02 - Systems Programming with Python

The "Python Is Slow for Systems Work" Myth

In 2019, Cloudflare published an engineering post describing their DNS resolver. The performance-critical path - receiving UDP datagrams, parsing wire-format DNS, and writing responses - was implemented in Python. It handled 1 million queries per second on commodity hardware. Not because Python's interpreter is fast (it isn't), but because the engineers understood where Python actually spends its time: the interpreter overhead exists only in userspace Python bytecode. The moment you call into the kernel - via a socket recv, a file read, or a mmap - you are executing C code compiled to machine instructions. Python is the orchestrator. The OS does the heavy lifting.

This module dismantles five specific misconceptions engineers carry into Python systems work:

Myth 1: "You can't write a high-performance server in Python." You can. Nginx-level concurrency for I/O-bound workloads is achievable using epoll or io_uring. The asyncio event loop is built on exactly this substrate. Gunicorn, uvicorn, and twisted all prove this in production daily.

Myth 2: "Python can't talk to the OS properly." The os, signal, resource, and fcntl modules expose nearly the entire POSIX API. os.fork(), os.execve(), os.waitpid(), signal.sigaction() - these are thin wrappers around the corresponding C library calls. The Python interpreter is itself a POSIX process and behaves like one.

Myth 3: "Shared memory and IPC aren't accessible from Python." multiprocessing.shared_memory.SharedMemory (Python 3.8+) gives you named POSIX shared memory regions accessible from multiple processes. A 500 MB NumPy array can be shared across 16 workers with zero copies. The same data structures that a C++ inference server would share live in the same physical memory pages, mapped into each Python process.

Myth 4: "If you need speed, you have to rewrite everything in C." You need to rewrite the hot path, typically 5% of the code. The ctypes, cffi, and Python C API give you surgical access to native code. A C extension that eliminates one inner-loop allocation can yield a 50x speedup while leaving the rest of the codebase in clean Python.

Myth 5: "Python's networking stack is just a wrapper around requests." requests is 4,000 lines of Python on top of urllib3 on top of Python's socket module. The socket module is a direct wrapper around BSD socket(2). You can write a working HTTP server in 80 lines using raw sockets with no dependencies. Understanding this stack means you can debug connection resets, TLS errors, and keepalive failures at the source.

What This Module Covers

This module is a ground-up systems programming curriculum using Python as the interface. Each lesson maps to a specific OS or systems concept. By the end, you will be able to:

  • Write production-quality signal handlers that survive SIGTERM without data loss
  • Build a non-blocking TCP server using epoll that scales to thousands of connections
  • Diagnose file descriptor leaks using /proc/self/fd and the resource module
  • Share a 1 GB model weight array across 32 worker processes with zero serialisation overhead
  • Write a C extension from scratch, manage reference counts correctly, and release the GIL for CPU-bound sections

Lesson Map

LessonFileOS/Systems Concept
01OS Primitives in PythonPOSIX process model, signals, fork/exec, resource limits
02Sockets and NetworkingBSD socket API, TCP/UDP, epoll, Unix domain sockets, TLS
03File Descriptors and I/OVFS, FD table, mmap, sendfile, epoll, zero-copy I/O
04Shared Memory and IPCPipes, FIFOs, POSIX shm, semaphores, multiprocessing
05Writing C ExtensionsPython C API, reference counting, GIL release, ctypes, cffi

The Linux Process Model and Where Python Sits

Every Python interpreter instance is a standard Linux process. It has a PID, a virtual address space, a file descriptor table, signal dispositions, and resource limits - just like any other process. The interpreter is not a virtual machine in the Java sense; it does not abstract away the OS. It sits on top of it, with very thin wrappers.

┌─────────────────────────────────────────────────────────────────────┐
│ User Space │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Python Application Code │ │
│ │ (your_module.py, business logic, frameworks) │ │
│ └────────────────────────────┬─────────────────────────────────┘ │
│ │ Python function calls │
│ ┌────────────────────────────▼─────────────────────────────────┐ │
│ │ CPython Standard Library │ │
│ │ os / signal / socket / mmap / ctypes / multiprocessing │ │
│ └────────────────────────────┬─────────────────────────────────┘ │
│ │ C function calls (libc, libpthread) │
│ ┌────────────────────────────▼─────────────────────────────────┐ │
│ │ CPython Interpreter (python3.12) │ │
│ │ eval loop, memory allocator, GC, GIL, module import system │ │
│ └────────────────────────────┬─────────────────────────────────┘ │
│ │ glibc / musl wrapper calls │
│ ┌────────────────────────────▼─────────────────────────────────┐ │
│ │ C Library (glibc / musl) │ │
│ │ malloc, pthread, stdio, networking, locale, time │ │
│ └────────────────────────────┬─────────────────────────────────┘ │
│ │ syscall instruction (int 0x80 / syscall) │
└───────────────────────────────┼─────────────────────────────────────┘

┌───────────────────────────────▼─────────────────────────────────────┐
│ Kernel Space │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ System Call Interface │ │
│ │ read, write, open, socket, mmap, fork, execve, sigaction │ │
│ └──────────┬─────────────────────────┬──────────────────────────┘ │
│ │ │ │
│ ┌──────────▼──────────┐ ┌──────────▼──────────┐ │
│ │ Virtual File System │ │ Network Stack │ │
│ │ (VFS) │ │ (TCP/IP, sockets) │ │
│ │ - inode table │ │ - sk_buff │ │
│ │ - dentry cache │ │ - socket buffers │ │
│ │ - page cache │ │ - netfilter │ │
│ └──────────┬──────────┘ └──────────┬──────────┘ │
│ │ │ │
│ ┌──────────▼──────────────────────────▼──────────┐ │
│ │ Memory Management │ │
│ │ - page allocator, slab allocator │ │
│ │ - virtual memory areas (VMAs) │ │
│ │ - copy-on-write, demand paging │ │
│ │ - shared memory (shmem), tmpfs │ │
│ └─────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘

The Process Address Space

When Python starts, the kernel creates a virtual address space for the process. This space contains:

High addresses
┌─────────────────────┐ 0xFFFFFFFFFFFFFFFF
│ Kernel space │ (not accessible from user space)
├─────────────────────┤ 0x00007FFFFFFFFFFF
│ Stack │ grows downward
│ (local variables, │
│ call frames) │
├─────────────────────┤
│ ↓ grows down │
│ │
│ ↑ grows up │
├─────────────────────┤
│ Heap │ Python's memory allocator lives here
│ (PyObject allocs, │ (pymalloc + system malloc)
│ mmap regions) │
├─────────────────────┤
│ BSS segment │ zero-initialized globals
├─────────────────────┤
│ Data segment │ initialized globals, string literals
├─────────────────────┤
│ Text segment │ python3.12 binary + .so extension modules
│ (read-only code) │
├─────────────────────┤
│ vDSO / vvar │ virtual dynamic shared object (gettimeofday etc)
└─────────────────────┘ 0x0000000000000000
Low addresses

Python's objects live in the heap. When you call mmap(), a new virtual memory area is inserted at an address chosen by the kernel. When you fork(), the entire address space is duplicated using copy-on-write - pages are not physically copied until either process writes to them.

The File Descriptor Table

Every Python process inherits three open file descriptors: 0 (stdin), 1 (stdout), 2 (stderr). Every open(), socket(), pipe(), and mmap() call adds entries:

Process (PID 12345)
┌─────────────────────────────────────┐
│ File Descriptor Table │
│ │
│ fd 0 ──────────────────────────── ├──► /dev/pts/0 (terminal stdin)
│ fd 1 ──────────────────────────── ├──► /dev/pts/0 (terminal stdout)
│ fd 2 ──────────────────────────── ├──► /dev/pts/0 (terminal stderr)
│ fd 3 ──────────────────────────── ├──► /var/log/app.log
│ fd 4 ──────────────────────────── ├──► socket:[12388] (TCP listen)
│ fd 5 ──────────────────────────── ├──► socket:[12401] (TCP client)
│ fd 6 ──────────────────────────── ├──► pipe:[99031] (IPC read end)
│ fd 7 ──────────────────────────── ├──► /dev/shm/model_weights
│ ... │
└─────────────────────────────────────┘
(points to kernel file description table entries)

The default limit is 1024 FDs per process (ulimit -n). In production you typically set this to 65536 or higher. Running out at 3 AM is the subject of Lesson 03.

The Module Dependency Graph

The five lessons build on each other:

Lesson 01: OS Primitives

├──► fork/exec/signals/resource limits
│ │
│ ▼
Lesson 02: Sockets ──────────────────────────────────────────────────┐
│ │
├──► socket() is a file descriptor │
│ │ │
│ ▼ │
Lesson 03: File Descriptors and I/O │
│ │
├──► mmap() used in both FD lesson and shared memory │
│ │ ▼
│ ▼ Lesson 05:
Lesson 04: Shared Memory and IPC C Extensions
│ │
├──► SharedMemory + NumPy buffers ◄───────────── buffer protocol ─┘

└──► Pipes use file descriptors (Lesson 03)

Prerequisites

This module assumes you have completed Module 01 (CPython Internals) or have equivalent knowledge of:

  • How the CPython interpreter executes bytecode
  • The GIL: what it protects, when it is released
  • Python's memory model: reference counting, cyclic GC
  • The difference between CPU-bound and I/O-bound workloads

You should also be comfortable with:

  • Basic Linux command-line usage (ps, lsof, strace, top)
  • Reading C header files (for the C extensions lesson)
  • asyncio basics (async/await syntax)

Environment Setup

All examples in this module run on Linux (kernel 5.x+) or macOS. Where Linux-specific APIs are used (epoll, /proc, sendfile), it is noted explicitly. Some examples require Python 3.10+ for structural pattern matching; all require Python 3.8+ minimum for SharedMemory.

# Verify your Python version
python3 --version # should be 3.10+

# Install optional dependencies used in examples
pip install numpy # for shared memory examples

# For the C extensions lesson
# macOS
xcode-select --install

# Linux
sudo apt-get install python3-dev build-essential

A Note on strace

Throughout this module, the most useful debugging tool is strace. It shows every system call your Python program makes:

# Trace a Python program and show all syscalls
strace -e trace=network,file python3 my_server.py

# Show file descriptor operations specifically
strace -e trace=read,write,open,close,socket,accept python3 server.py

# Attach to a running process
strace -p 12345

# Count syscalls by frequency
strace -c python3 my_program.py

When something goes wrong in systems code - a connection that drops, a file that can't be opened, a process that hangs - strace shows you the exact syscall that failed and the errno that explains why. Before reading a stack trace, run strace. It will save you hours.

© 2026 EngineersOfAI. All rights reserved.